UnShuffle Sort is a sort algorithm.
Contents |
The UnShuffle Sort is a distribution or merge sort which was developed by Art S. Kagel. UnShuffle is most efficient when sorting data which exhibits low entropy, in effect where the data is well ordered or contains sub-sequences of ordered items. It was first mentioned in an article in Computer Language Magazine (Vol. 3 No. 11, November 1985). The current implementation is the result of several years of additional experimentation. The sort involves two phases. During the first distribution phase items are distributed to a variable number of ordered lists using a structure that minimizes the number of comparisons required. Once all items have been distributed the sorted lists are merged to the output list. UnShuffle is one of a very few sorts that can be applied directly to linked lists.
The algorithm utilizes a data structure dubbed a 'pile' which is an ordered deque which permits elements to be added to the head of the list if the new item is valued less than or equal to the current head or to the tail of the list if the new item is greater than or equal to the current tail. It is not permitted to insert elements between existing elements. This is normally implemented as a doubly linked list of lists with pointers to the head and tail of the doubly linked list containing the data items on the pile and to the next and previous pile. To keep things simple some normal optimizations are not included in the description of the algorithm but are presented at the end. Details of removing items from the input stream/list and appending items to the output stream/list are omitted as implementation specific.
Best case behavior for UnShuffle degenerates into a sort check for a fully ordered or ordered but reversed data set with only one pile created using only N-1 comparisons (with the recommended optimization) and no merge necessary. There is no worst case behavior, it can be shown that no data set will perform worse than the Order of the sort which is O((K/4)*N) for Phase I + O (N*(log K)) for Phase II using an Ideal Merge for Phase II where K is the number of piles created during the distribution phase (Phase I) and is proportional to the level of entropy in the data.
Because it sorts linked lists rather than data arrays only pointers are moved and because of the pile structure there are no expensive swaps performed so sort time is not dependent on the size of the data but only on the length and complexity of the key.
Use an Ideal Merge to merge the piles created in Phase I. Output to the output linked list or data stream. Ideal Merge is a merge algorithm Art S. Kagel believes he also invented which can be shown to be the best possible merge of sorted sinks (queues). Algorithm follows.
Keep track of whether the last element was placed on the top or bottom of some pile and begin comparison for the next element on the same side (i.e. if the last element was placed on the bottom of a pile begin comparisons with the bottom of the last pile rather than the top and switch to comparing to the top if the item is greater than the bottom of the last pile). For the general case and for highly ordered data also this is the single most important optimization to the basic algorithm saving, on average, N/2 comparisons and maximum of N-1 comparisons for reversed input.
Keep track of which pile the last element was appended or prepended to and begin comparison there. Intuitively this should be a good optimization and for certain data sets it may be, but testing has shown that the added complexity of the modified algorithm overshadows any benefit in the general case.
The general Ideal Merge algorithm can be enhanced by knowledge of the nature of the Distribution Phase:
When used as the merge for Unshuffle sorting, the creation of the list of queues and the initial sort of the piles is not needed since the distribution creates the piles ordered by their top element. In this case the merge begins with step #3.
Since runs of elements may be prepended to each pile before any other pile is prepended or appended to there is benefit to comparing the next item on the first pile to the top of the second pile, after having output the top element from the pile, to determine if the binary reordering search is needed yet, if not that element can be immediately output and this step repeated. (I've included this optimization in my description of the merge above.)
For data arriving in a stream, like a file sort or piped input, create an iterative version of the algorithm with an add_element() function and an output_element() function. The add_element would sort the single element into the piles structure. The first call to the output function will block further add_element() calls from modifying the data structures and begin the merge by outputting the first element on pile 1 and performing the pile reorder operation if needed. Each subsequent call will return the top of the first pile and reorder as needed.
Arrays can also be treated as a stream and subjected to the iterative sort rather than build a linked list from the array. Because of the function call overhead the benefit of this suggestion may be marginal at best but it suggests that a non-iterative version that treats the array as a stream may do even better. Hmm, an array sort version that substitutes an internal add-loop for the external loop calling an add-to-stream function should be fast enough to make direct array sorting feasible.
UnShuffle is not naturally a stable sort but one can force stability by appending the input record number to the end of the sort key.
Duplicate record elimination can be performed during both Phase I and Phase II or delayed until Phase II. The elimination accomplished during Phase I is partial only so the elimination in Phase II is still needed but processing time, and IO cost (if sort-work files are needed for larger data sets), is saved by doing both.
Because the second phase is a merge of sorted sinks, it is possible to apply the distribution phase in several independent parallel operations through separate processes or threads and then to apply the merge to the complete set of piles.
|